ChatGPTでScrapboxTranslator v1を作る

ChatGPTを使って、「ChatGPTを使って翻訳するコード」を書くblu3mo.icon

成果: ScrapboxTranslator v1

-.icon

blu3mo.icon

code:prompt

jsonファイルの日本語ドキュメントを元に、英訳したjsonファイルを出力するPythonコードを書きたいです。

# jsonファイルのフォーマット

{

"pages": [

{

"title": "page1title",

"lines": [

"page1title",

"line2",

"line3"

]

{

"title": "page2title",

"lines": [

"page2title",

"line2",

"line3"

]

...

]

}

# 実行すること

1. jsonファイルを受け取る

2. 全pageのtitleのみを抽出し、英訳し、日英のtitleをまとめたdictionaryを作る

3. それぞれのページの本文（lines)を翻訳する。注意: 他のページのtitleへのリンクがタイトルと鉤括弧に囲われ含まれている。それらのタイトルは、dictionaryが持つ英訳に一字一句従って翻訳される必要がある。

# 翻訳の方法

ChatGPT APIを用いる。ChatGPT APIの入力と出力のサンプルは以下の通りである。assistantに英訳する様に伝え、出力を使いたい。翻訳時に条件や注意点がある時は、そのことも伝える。

`sample call

# Note: you need to be using OpenAI Python v0.27.0 for the code below to work

import openai

openai.ChatCompletion.create(

model="gpt-3.5-turbo",

messages=[

{"role": "system", "content": "You are a helpful assistant."},

{"role": "user", "content": "Who won the world series in 2020?"},

{"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},

{"role": "user", "content": "Where was it played?"}

]

)

`sample response

{

'id': 'chatcmpl-6p9XYPYSTTRi0xEviKjjilqrWU2Ve',

'object': 'chat.completion',

'created': 1677649420,

'model': 'gpt-3.5-turbo',

'usage': {'prompt_tokens': 56, 'completion_tokens': 31, 'total_tokens': 87},

'choices': [

{

'message': {

'role': 'assistant',

'content': 'The 2020 World Series was played in Arlington, Texas at the Globe Life Field, which was the new home stadium for the Texas Rangers.'},

'finish_reason': 'stop',

'index': 0

}

]

}

chatgpt.icon

code: python

import json

import openai

def translate(text, role="user"):

response = openai.ChatCompletion.create(

model="gpt-3.5-turbo",

messages=[

{"role": "system", "content": "You are a helpful assistant that translates Japanese text to English."},

{"role": role, "content": text}

]

)

return response.choices0.message.content

def translate_json_file(input_file, output_file):

with open(input_file, 'r', encoding='utf-8') as f:

data = json.load(f)

title_translation_dict = {}

# Translate page titles

for page in data'pages':

translated_title = translate(page'title')

title_translation_dict[page'title'] = translated_title

page'title' = translated_title

# Translate lines with translated titles replaced

for page in data'pages':

translated_lines = []

for line in page'lines':

for jp_title, en_title in title_translation_dict.items():

line = line.replace(f"{jp_title}", f"{en_title}")

translated_line = translate(line)

translated_lines.append(translated_line)

page'lines' = translated_lines

with open(output_file, 'w', encoding='utf-8') as f:

json.dump(data, f, ensure_ascii=False, indent=2)

# Replace 'input.json' and 'output.json' with your desired file paths

translate_json_file('input.json', 'output.json')

多分できたのではblu3mo.icon

リンクは本文翻訳する前に置き換えたのか。うまくいくなら、確かにその方がシンプルだな

後で検証する

ありがとう。改善点として、ドキュメントに更新があった時に更新部分だけ翻訳し直す様にしたい。どの様な実装が良いかな。blu3mo.icon

chatgpt.icon

code: python

import json

import openai

def has_changes(new_line, old_line):

return new_line != old_line

def translate_updated_json(input_file, old_input_file, translated_file):

with open(input_file, 'r', encoding='utf-8') as f:

new_data = json.load(f)

with open(old_input_file, 'r', encoding='utf-8') as f:

old_data = json.load(f)

with open(translated_file, 'r', encoding='utf-8') as f:

translated_data = json.load(f)

for new_page, old_page, translated_page in zip(new_data'pages', old_data'pages', translated_data'pages'):

if has_changes(new_page'title', old_page'title'):

translated_title = translate(new_page'title')

translated_page'title' = translated_title

for idx, (new_line, old_line) in enumerate(zip(new_page'lines', old_page'lines')):

if has_changes(new_line, old_line):

translated_line = translate(new_line)

translated_page'lines'idx = translated_line

with open(translated_file, 'w', encoding='utf-8') as f:

json.dump(translated_data, f, ensure_ascii=False, indent=2)

# Replace the file paths with your desired file paths

translate_updated_json('input_new.json', 'input_old.json', 'output.json')

この実装では、JSONファイルのページ数と構造が同じであることを前提としています。ファイル構造が大幅に変更される場合や、ページが追加/削除される場合は、この実装が正しく機能しない可能性があります。そのようなシナリオに対応するには、より堅chatgpt.icon

ファイル構造は同じだと思うけど、ページが追加/削除される可能性は大いにある。それに対応してほしいblu3mo.icon

chatgpt.icon

code: python

import json

import openai

def has_changes(new_line, old_line):

return new_line != old_line

def translate_updated_json(input_file, old_input_file, translated_file):

with open(input_file, 'r', encoding='utf-8') as f:

new_data = json.load(f)

with open(old_input_file, 'r', encoding='utf-8') as f:

old_data = json.load(f)

with open(translated_file, 'r', encoding='utf-8') as f:

translated_data = json.load(f)

old_pages_dict = {page'title': page for page in old_data'pages'}

translated_pages_dict = {page'title': page for page in translated_data'pages'}

new_translated_pages = []

for new_page in new_data'pages':

new_title = new_page'title'

if new_title not in old_pages_dict: # Page is added

translated_page = {

'title': translate(new_title),

'lines': [translate(line) for line in new_page'lines']

}

else: # Page exists in the old data

old_page = old_pages_dictnew_title

translated_page = translated_pages_dictnew_title

if has_changes(new_page'title', old_page'title'):

translated_page'title' = translate(new_page'title')

for idx, (new_line, old_line) in enumerate(zip(new_page'lines', old_page'lines')):

if has_changes(new_line, old_line):

translated_page'lines'idx = translate(new_line)

new_translated_pages.append(translated_page)

# Remove deleted pages from the translated data

translated_data'pages' = [page for page in translated_data'pages' if page'title' in old_pages_dict]

translated_data'pages' = new_translated_pages

with open(translated_file, 'w', encoding='utf-8') as f:

json.dump(translated_data, f, ensure_ascii=False, indent=2)

# Replace the file paths with your desired file paths

translate_updated_json('input_new.json', 'input_old.json', 'output.json')

差分対応の前に、とりあえず普通に翻訳できるものを作るかblu3mo.icon

blu3mo.icon（履歴を遡って編集）

ありがとう。

上のコードは、それぞれのtitleやlineで毎回translate()を呼んでいますが、呼べば呼ぶほどAPIのコストがかかるので、呼ぶ回数は最小限にしたい。しかし、ChatCompletionに渡せる最大の文字数は8000程度です。

titleを翻訳する時には、渡せる最大文字数分までを一度に翻訳してほしい。

それぞれのページを訳す時は、同じページは一度に翻訳してほしい。文字数制限に引っかかるなら適宜分割してほしい。

chatgpt.icon

code: python

import json

import openai

MAX_TOKENS = 8000

def translate(text, role="user"):

response = openai.ChatCompletion.create(

model="gpt-3.5-turbo",

messages=[

{"role": "system", "content": "You are a helpful assistant that translates Japanese text to English."},

{"role": role, "content": text}

]

)

return response.choices0.message.content

def translate_titles(title_list):

translated_titles = []

title_chunk = ""

for title in title_list:

if len(title_chunk) + len(title) + 1 < MAX_TOKENS:

title_chunk += title + "\n"

else:

translated_chunk = translate(title_chunk, role="title_translation")

translated_titles.extend(translated_chunk.split("\n"):-1)

title_chunk = title + "\n"

if title_chunk:

translated_chunk = translate(title_chunk, role="title_translation")

translated_titles.extend(translated_chunk.split("\n"):-1)

return translated_titles

def translate_page(page_text):

if len(page_text) <= MAX_TOKENS:

return translate(page_text, role="page_translation")

else:

split_point = page_text.rfind("\n", 0, MAX_TOKENS)

first_half = page_text:split_point

second_half = page_textsplit_point + 1:

return translate_page(first_half) + "\n" + translate_page(second_half)

def translate_json_file(input_file, output_file):

with open(input_file, 'r', encoding='utf-8') as f:

data = json.load(f)

title_translation_dict = {}

# Translate page titles

title_list = [page'title' for page in data'pages']

translated_titles = translate_titles(title_list)

for original_title, translated_title in zip(title_list, translated_titles):

title_translation_dictoriginal_title = translated_title

for page, translated_title in zip(data'pages', translated_titles):

page'title' = translated_title

# Translate lines with translated titles replaced

for page in data'pages':

page_text = "\n".join(page'lines')

for jp_title, en_title in title_translation_dict.items():

page_text = page_text.replace(f"{jp_title}", f"{en_title}")

translated_text = translate_page(page_text)

page'lines' = translated_text.split("\n")

with open(output_file, 'w', encoding='utf-8') as f:

json.dump(data, f, ensure_ascii=False, indent=2)

# Replace 'input.json' and 'output.json' with your desired file paths

translate_json_file('input.json', 'output.json')

試してみたら、いくつか問題があったblu3mo.icon

なんか勝手にGPTにページを要約されちゃうことがある

GPTにもっと機械的に翻訳させるべきかな

jsonで渡して、jsonで返してもらう、みたいな

いや、jsonに正確に従わせるのはむしろむずいなblu3mo.icon

箇条書きが崩れる

これは、崩すなと指示すれば良いか

https://gyazo.com/dde921c73a3bfbc062fa8e808cfece0c

playgroundでpromptを調整blu3mo.icon

ありがとう。今はChatCompletionを同期的に呼んでいると思うけど、これを並列・非同期処理にしてほしい。blu3mo.icon

chatgpt.icon

code: python

import json

import asyncio

import aiohttp

import openai

MAX_TOKENS = 8000

INPUT_PATH = "input_json/test1.json"

OUTPUT_PATH = "output_json/test1_2.json"

PROMPT = """

You are a translator.

# Task

You get multiple texts to translate. Translate texts to English, and return translated texts.

# Rules

Keep the number of lines and newlines. Never remove spaces at the beginning of each line. Keep the number of spaces the same.

Brackets of text and text.icon must be kept. The content inside the bracket must never be changed.

"""

async def async_translate(session, text, role="user"):

headers = {

"Content-Type": "application/json",

"Authorization": f"Bearer {openai.api_key}"

}

data = {

"model": "gpt-3.5-turbo",

"messages": [

{"role": "system", "content": PROMPT},

{"role": role, "content": text}

]

}

async with session.post("https://api.openai.com/v1/chat/completions", headers=headers, json=data) as resp:

response = await resp.json()

print(response)

return response"choices"0"message""content"

async def translate_titles(session, title_list):

translated_titles = []

title_chunk = ""

for title in title_list:

if len(title_chunk) + len(title) + 1 < MAX_TOKENS:

title_chunk += title + "\n"

else:

translated_chunk = await async_translate(session, title_chunk)

translated_titles.extend(translated_chunk.split("\n"):-1)

title_chunk = title + "\n"

if title_chunk:

translated_chunk = await async_translate(session, title_chunk)

translated_titles.extend(translated_chunk.split("\n"):-1)

return translated_titles

async def translate_page(session, page_text):

if len(page_text) <= MAX_TOKENS:

return await async_translate(session, page_text)

else:

split_point = page_text.rfind("\n", 0, MAX_TOKENS)

first_half = page_text:split_point

second_half = page_textsplit_point + 1:

first_half_translated = await async_translate(session, first_half)

second_half_translated = await translate_page(session, second_half)

return first_half_translated + "\n" + second_half_translated

async def translate_json_file(input_file, output_file):

with open(input_file, 'r', encoding='utf-8') as f:

data = json.load(f)

title_translation_dict = {}

# Translate page titles

title_list = [page'title' for page in data'pages']

async with aiohttp.ClientSession() as session:

translated_titles = await translate_titles(session, title_list)

for original_title, translated_title in zip(title_list, translated_titles):

title_translation_dictoriginal_title = translated_title

for page, translated_title in zip(data'pages', translated_titles):

page'title' = translated_title

# Translate lines with translated titles replaced

for page in data'pages':

page_text = "\n".join(page'lines')

for jp_title, en_title in title_translation_dict.items():

page_text = page_text.replace(f"{jp_title}", f"{en_title}")

translated_text = await translate_page(session, page_text)

page'lines' = translated_text.split("\n")

with open(output_file, 'w', encoding='utf-8') as f:

json.dump(data, f, ensure_ascii=False, indent=2)

# Replace 'input.json' and 'output.json' with your desired file paths

async def main():

await translate_json_file(INPUT_PATH, OUTPUT_PATH)

asyncio.run(main())

（APIの使い方が一部間違っていたので、それはblu3mo.iconが手直しした）

これでも遅いなblu3mo.icon

いや、これよくみたら非同期処理になってないな

結局for loopでawaitを呼んでるので

blu3mo.icon

ありがとう。修正点として、roleは常にsystemかuserであるべき。

あと、それぞれのtranslate_pageはconcurrentlyにやってほしい。

chatgpt.icon

code: python

import json

import asyncio

import aiohttp

import openai

MAX_TOKENS = 8000

async def async_translate(session, text):

headers = {

"Content-Type": "application/json",

"Authorization": f"Bearer {openai.api_key}"

}

data = {

"model": "gpt-3.5-turbo",

"messages": [

{"role": "system", "content": "You are a helpful assistant that translates Japanese text to English."},

{"role": "user", "content": text}

]

}

async with session.post("https://api.openai.com/v1/chat/completions", headers=headers, json=data) as resp:

response = await resp.json()

return response"choices"0"message""content"

async def translate_titles(session, title_list):

translated_titles = []

title_chunk = ""

for title in title_list:

if len(title_chunk) + len(title) + 1 < MAX_TOKENS:

title_chunk += title + "\n"

else:

translated_chunk = await async_translate(session, title_chunk)

translated_titles.extend(translated_chunk.split("\n"):-1)

title_chunk = title + "\n"

if title_chunk:

translated_chunk = await async_translate(session, title_chunk)

translated_titles.extend(translated_chunk.split("\n"):-1)

return translated_titles

async def translate_page(session, page_text):

if len(page_text) <= MAX_TOKENS:

return await async_translate(session, page_text)

else:

split_point = page_text.rfind("\n", 0, MAX_TOKENS)

first_half = page_text:split_point

second_half = page_textsplit_point + 1:

first_half_translated = await async_translate(session, first_half)

second_half_translated = await translate_page(session, second_half)

return first_half_translated + "\n" + second_half_translated

async def translate_json_file(input_file, output_file):

with open(input_file, 'r', encoding='utf-8') as f:

data = json.load(f)

title_translation_dict = {}

# Translate page titles

title_list = [page'title' for page in data'pages']

async with aiohttp.ClientSession() as session:

translated_titles = await translate_titles(session, title_list)

for original_title, translated_title in zip(title_list, translated_titles):

title_translation_dictoriginal_title = translated_title

for page, translated_title in zip(data'pages', translated_titles):

page'title' = translated_title

# Translate lines with translated titles replaced

translation_tasks = []

for page in data'pages':

page_text = "\n".join(page'lines')

for jp_title, en_title in title_translation_dict.items():

page_text = page_text.replace(f"{jp_title}", f"{en_title}")

translation_tasks.append(translate_page(session, page_text))

translated_texts = await asyncio.gather(*translation_tasks)

for page, translated_text in zip(data'pages', translated_texts):

page'lines' = translated_text.split("\n")

with open(output_file, 'w', encoding='utf-8') as f:

json.dump(data, f, ensure_ascii=False, indent=2)

# Replace 'input.json' and 'output.json' with your desired file paths

async def main():

await translate_json_file('input.json', 'output.json')

asyncio.run(main())

新しいpromptで試したblu3mo.icon

いいところ

ちゃんと翻訳後リンクが機能している

課題

箇条書きのインデントがやっぱ崩れる

あ、中身見たらインデントは保たれていたけど、スペースが\tに置き換わっていた

プロンプトを修正blu3mo.icon

翻訳後に改行が保たれない

なんでだ、、、blu3mo.icon

GPTが改行を認識していない説がある

これかもしれない

Bing AIだと改行消されてたし、ありえる

外部リンクも翻訳されてしまう

ex: /villagepump/ChatGPTでScrapboxTranslatorを作るというリンクが翻訳されて、リンク切れを起こす

これはめんどいのでとりあえずいいかな

プロンプト頑張れば治せそう

最終的に、改行と空白（インデント）を\s, \nに置き換えてから翻訳させることにした

これはうまくいった✅

実況が面倒になったのでやめるblu3mo.icon

https://sharegpt.com/c/Tga6MNE

これでログが見れるはずblu3mo.icon

トークンカウントの不具合を修正

https://sharegpt.com/c/lmEB2ys

人間がバグを見つけて、機械が治す

素晴らしい世界blu3mo.icon

https://github.com/openai/openai-cookbook/blob/main/examples/How_to_handle_rate_limits.ipynb

rate limitがあることに気づいたので、Semaphoreで制限をかける

（自分はSemaphoreを知らなかったけど、ChatGPTがやってくれた）

とりあえずエラーはキャッチして無視する様にしたけど、ちらほらエラーが出るな

responseが空のパターン

token数が超えているみたいだけど、なぜbatchingが効いていないのかわからない

token数の計算が間違っている?

本文が英語の時に起きがちだなblu3mo.icon

txt/htmlの謎responseが帰ってくるパターン

Error occurred while making request: 0, message='Attempt to decode JSON with unexpected mimetype: text/html; charset=utf-8', url=URL('https://api.openai.com/v1/chat/completions')

これは中身見ないとわからんな

空白\sと改行\nにそれぞれ2トークン使ってしまうの勿体無いな

https://gyazo.com/f392ff8e6a882e5d60e55a5b045461d8

これとかめっちゃ無駄遣いしている

滅多に使わないけど1 tokenの記号を見つけたい

☆

この辺りか

どう頑張っても安定しないので、「翻訳後の行数が翻訳前の行数に±3以上の差があったら再度翻訳する」みたいな仕組みで対処することにしたblu3mo.icon

temperatureを上げつつ、三回まで試す

ChatGPTでつぎはぎで機能を足していると、どんどんコードが読みづらくなっていくなblu3mo.icon

コードのリファクタリングはChatGPTでできるのか気になるinajob.icon

todo:

ページ1行目は訳したタイトルで置き換える

タイトル翻訳は行数が一致するまでループでやらせても良さそう